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' Abstract 

In certain applications it is useful to fit multinomial distributions to observed data with a penalty term 
that encourages sparsity. For example, in probabilistic latent audio source decomposition one may wish to 
encode the assumption that only a few latent sources are active at any given time. The standard heuristic 
on : of applying an LI penalty is not an option when fitting the parameters to a multinomial distribution, 

which are constrained to sum to 1. An alternative is to use a penalty term that encourages low-entropy 
solutions, which corresponds to maximum a posteriori (MAP) parameter estimation with an entropic 
prior. The lack of conjugacy between the entropic prior and the multinomial distribution complicates 
this approach. In this report I propose a simple iterative algorithm for MAP estimation of multinomial 
distributions with sparsity-inducing entropic priors. 



1 Introduction 

Suppose we want to estimate the parameter 8 to a multinomial distribution responsible for generating N 
observations a;, S {!,..., K}. The log-likelihood of the data is given by 



J> logp(a:) = y;iog0 B1J (1) 



and the maximum-likelihood estimate of 8 is simply 

e k oc = k], (2) 



where I is an indicator function whose value is 1 if its argument is true and if its argument is false. 

The maximum-likelihood estimate may not be optimal if we have a priori knowledge that leads us to 
believe that 8 is sparse. For example, if 8 indicated the relative loudness of a set of 88 piano notes at 
particular moment (as it might in an application of Probabilistic Latent Component Analysis to audio 
spectrograms pQ), then we might expect only a few elements of 8 to be much greater than 0. This would 
correspond to the intuition that pianists rarely mash the entire piano keyboard at once. 

To incorporate this prior intuition into our analysis, we might add a penalty term to our log-likelihood 
function that encourages sparse settings of 8. A common heuristic for inducing sparsity in optimization 
problems is to introduce an LI penalty term into the cost function (for example, in lasso regression [2]). 
This is not an option here, since the LI norm of 8 is constrained to be 1. A natural alternative is to include 
a negative entropy term in the log-likelihood function, corresponding to placing an unnormalized sparse 
entropic prior on 8: 

logp(0, x) = constant + a 6k log Oj, + log 9 Xi . (3) 

k i 

The constant a controls the strength of the prior p(8) oc exp{a^ fc 9k log#fc}. If a is positive, then this prior 
will give higher weight to low-entropy settings of 8. 

Unfortunately, the Maximum A Posteriori (MAP) estimate of 8 does not have a simple analytic form 
for this model, since the entropic prior is not conjugate to the multinomial distribution. In the following 
section, I propose a simple iterative scheme for MAP estimation of 8 when a is positive. 
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2 A MAP Inference Scheme for the Sparse Entropic Prior 

Our strategy is based on optimizing the following approximate auxiliary function for the negative entropy 
term in equation [31 

e(a,u,0,a) = a^a k (v log 6 k - (v - l)loga fc ), (4) 

k 

where a is a free parameter such that ^2 k a k = 1 and a k > 0, and v is a real-valued scalar constrained to 
be greater than 1. Taking the derivative of the Lagrangian of I with respect to a k yields 

di 

- — = avlogd k - a(y - 1)(1 + loga fc ) + A. (5) 

Setting the right side equal to zero shows that i is optimized with respect to a when 

a k cx exp | — ^-j- log fc j = O^ 1 . (6) 

When v is large, this implies that the optimal value of I can only be achieved when a k ~ 9 k . 
When a k — 9 k , we recover the original entropic prior term: 

£(a,v,6,8) = aJ2Ok(v\o g k - (is - l)\o g k ) = aJ2°klo g e k . (7) 

k k 

Thus, for sufficiently large values of v, when a is optimally chosen I approximates the entropic prior. We 
may therefore substitute £ (with a large value of v) for the entropic prior term in equation [3] and jointly 
optimize the approximate objective 

C = a^a k (v log k - (v - 1) loga fc ) + ^\og9 Xi . (8) 

k i 

over a and 9. When the gradient of C with respect to ex is 0, as it must be at a local optimum of C, a k « 8 k , 
and so C ~ logp(a;, 6), the objective function of interest. 

A simple fixed-point iteration can be used to optimize C over a and 6. The gradient of the Lagrangian 
with respect to 9 k is 

= —[ aa k v + > ] l[ Xi = k] I + A, (9) 



oe k e k 



Y^l[xi = k]j +A, 



where I is an indicator function whose value is 1 if its argument is true and if its argument is false. C is 
therefore maximized with respect to 9 when 

9 k oc a<x k v + l[xj = k]. (10) 

i 

As observed above, C is maximized with respect to a. when 

auo^ef 1 . (11) 

By iterating between the updates in equation [TU] and 111! we reach a stationary point of C. At such a 
stationary point, C f=s log p(x,0), and so we may conclude that the value of 6 at a stationary point of C 
yields approximately a local optimum of logp(x, 0). 

Note that these updates may require a number of iterations to converge, and the number of iterations 
needed is likely to grow with v. However, the cost of each update is minimal. If these updates are incor- 
porated as part of a larger coordinate ascent algorithm like the expectation-maximization algorithm used in 
probabilistic latent semantic indexing [3], the additional expense involved in iterating between updating 6 
and a is likely to be dominated by the cost of computing = h] ( or its expected value). 



2 



3 Conclusion 



I have presented a simple fixed-point iteration for performing approximate maximum a posteriori estimation 
of multinomial parameters in the presence of a sparsity-inducing entropic prior. This algorithm only provides 
an approximate solution, but it can be made arbitrarily accurate at the cost of slower convergence. The 
algorithm is very easy to implement, and the cost per iteration is minimal. 
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